Complete this quiz in a .Rmd file. To turn in the quiz, push both a .Rmd file and a knitted .html file to your GitHub site.

Statement of Integrity: Copy and paste the following statement and then sign your name (by typing it) on the line below.

“All work presented is my own. I have not communicated with or worked with anyone else on this quiz.”

Collaboration Reminder: You may not communicate with or work with anyone else on this quiz, but you may use any of our course materials or materials on the Internet.

Question 1 (7 points). Consider the following two bar plots using the palmerpenguins data set. The first is a plot of the penguin species while the second is a plot of the average bill length for each species.

library(palmerpenguins)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
ggplot(data = penguins, aes(x = species)) +
  geom_bar() +
  labs(y = "Count")

ggplot(data = penguins %>% group_by(species) %>%
         summarise(avg_length = mean(bill_length_mm, na.rm = TRUE)),
       aes(x = species, y = avg_length)) +
  geom_col() +
  labs(y = "Average Bill Length")

Which of the two graphs is appropriate to construct? Give a one sentence reason.

Because the second plot is showing a summary of continuous data, the first plot is appropriate to construct as it is just showing the difference in the number of each species in the data set.

Question 2 (9 points). Use the Happy Planet Index data set to construct a graph that does not properly show variability in the underlying data. Recall that some variables in this data set are LifeExpectancy, Wellbeing, Footprint, and Region of the world.

library(here)
## here() starts at /Users/clairedudley/Library/CloudStorage/OneDrive-St.LawrenceUniversity/Stat Classes/STAT_4005/STAT4005_Data_Viz
hpi_df <- read_csv(here("data/hpi-tidy.csv"))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   HPIRank = col_double(),
##   Country = col_character(),
##   LifeExpectancy = col_double(),
##   Wellbeing = col_double(),
##   HappyLifeYears = col_double(),
##   Footprint = col_double(),
##   HappyPlanetIndex = col_double(),
##   Population = col_double(),
##   GDPcapita = col_double(),
##   GovernanceRank = col_character(),
##   Region = col_character()
## )
hpi_df2 <- hpi_df %>%
  group_by(Region) %>%
  summarise(meanLE = mean(LifeExpectancy)) %>%
  mutate(Region = fct_reorder(Region, meanLE))
ggplot(data = hpi_df2, aes(x = Region, y = meanLE)) + geom_col() + coord_flip()

Question 3 (7 points). Fix your graph from the previous question so that it does properly show variability in the underlying data.

hpi_df <-
  hpi_df %>%
mutate(Region = fct_reorder(Region, LifeExpectancy, .fun = median)) %>% group_by(Region) %>%
  mutate(ncountries = n())
p <- ggplot(data = hpi_df, aes(x = Region, y = LifeExpectancy)) + 
  geom_boxplot() + 
  geom_point(alpha = 0, aes(x = Region, y = LifeExpectancy,
                            text = paste0("n = ", ncountries))) +
  coord_flip()
## Warning: Ignoring unknown aesthetics: text
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(p, tooltip = "text") %>%
  style(hoverinfo = "skip", traces = 1) ## says to "skip" the first geom()